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Sequential Information Guided Sensing 

Ruiyang Song, Yao Xie, Sebastian Pokutta 


Abstract —We study the value of information in sequential com¬ 
pressed sensing by characterizing the performance of sequential 
information guided sensing in practical scenarios when information 
is inaccurate. In particular, we assume the signal distribution 
is parameterized through Gaussian or Gaussian mixtures with 
estimated mean and covariance matrices, and we can measure 
compressively through a noisy linear projection or using one- 
sparse vectors, i.e., observing one entry of the signal each time. 
We establish a set of performance bounds for the bias and 
variance of the signal estimator via posterior mean, by capturing 
the conditional entropy (which is also related to the size of the 
uncertainty), and the additional power required due to inaccurate 
information to reach a desired precision. Based on this, we further 
study how to estimate covariance based on direct samples or covari¬ 
ance sketching. Numerical examples also demonstrate the superior 
performance of Info-Greedy Sensing algorithms compared with 
their random and non-adaptive counterparts. 

Index Terms —compressed sensing, mutual information, sequen¬ 
tial methods, sketching 


1. Introduction 

Sequential compressed sensing is a promising new information 
acquisition and recovery technique to process big data that arises 
in various applications such as compressive imaging 0 - 0 , 
power network monitoring 0, and large scale sensor networks 
Q. The sequential nature of the problems is either because the 
measurements are taken one after another, or due to the fact that 
the data is obtained in a streaming fashion so that it has to be 
processed in one pass. 

To harvest the benefits of adaptivity in sequential compressed 
sensing, various algorithms have been developed (see ||^ for 
a review.) We may classify these algorithms as (1) being 
agnostic about the signal distribution and, hence, using random 
measurements 0-03; (2) exploiting additional structure of 
the signal (such as graphical structure HD and tree-sparse 
structure |T5| , | [T^ ) to design measurements; (3) exploiting 
the distributional information of the signal in choosing the 
measurements possibly through maximizing mutual information: 
the seminal Bayesian compressive sensing work |T7| , Gaussian 
mixture models (GMM) |T^ , OD’ and our earlier work © 
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which presents a general framework for information guided 
sensing referred to as Info-Greedy Sensing. 


Such additional knowledge about signal structure or distribu¬ 
tions are various forms of information about the unknown signal. 
Information may play a distinguishing role: as the compressive 
imaging example demonstrated in Fig.(see Section Vf_ for more 
details), with a bit of (albeit inaccurate) information estimated 
via random samples of small patches of the image, Info-Greedy 
Sensing is able to recover details of a high-resolution image, 
whereas random measurements completely miss the image. 


In this paper we examine the value of information in sequential 
compressed sensing by considering Info-Greedy Sensing when 
information is imprecise. Info-Greedy Sensing is a framework 
introduced in ^ that aims at designing subsequent measurements 
to maximize the mutual information conditioned on previous 
measurements. Conditional mutual information is a natural metric 
here, as it captures exclusively useful new information between 
the signal and the results of the measurements disregarding 
noise and what has already been learned from previous measure¬ 
ments. We assume information is parameterized imperfectly and 
captured through sample estimates or “sketching”, and when 
measurements of the unknown signal are compressive or even 
one-sparse (we are only able to inspect one entry of the signal). 
As shown in 0, Info-Greedy Sensing for a Gaussian signal 
becomes a simple iterative algorithm: choosing the measurement 
as the leading eigenvector of the conditional signal covariance 
matrix in that iteration, and then update the covariance matrix via 
a simple rank-one update, or, equivalently, choosing measurement 
vectors ai, a 2 ,... as the orthonormal eigenvectors of the signal 
covariance matrix S in a decreasing order of eigenvalues. This 
can also be easily generalized to GMM signals, where a heuristic 
that works well is to measure in the dominant eigenvector 
direction of the Gaussian component with the highest posterior 
weight in that iteration. 

In practice, we may be able to estimate the signal covariance 
matrix to initialize the algorithm through a training session. For 
Gaussian signals, there are two possible approaches: either using 
training samples that are sampled from the same distribution, 
or through the so-called “covariance sketching” |[20|-|[^ based 
on low-dimensional random sketches of the samples. As a 
consequence, the measurement vectors are calculated from 
eigenvectors of the estimated covariance matrix S, which 
deviates from the optimal directions. Since we almost always 
have to use an estimate for the signal covariance, it is crucial 
to quantify the performance of sensing algorithms with model 
mismatch and shed some light on how to properly initialize the 
algorithm. 





In this paper we characterize the performance of Info-Greedy 
Sensing for Gaussian and GMM signals (with possibly low-rank 
covariance matrices) when the true signal covariance matrix is 
replaced with a proxy, which may be an estimate from direct 
samples or using a covariance sketching scheme. We establish 
a set of theoretical results including (1) studying the bias and 
variance of the signal estimator via posterior mean, by relating 
the error in the covariance matrix US — Sjl to the entropy of the 
signal posterior distribution after each sequential measurement, 
(2) establishing an upper bound on the additional power needed 
to achieve the signal precision ||x — x|| < e; and (3) translate 
these into requirements on the choice of the sample covariance 
matrix through direct estimation or through covariance sketching. 
Furthermore, we also study Info-Greedy Sensing in a special 
setting when the measurement vector is desired to be one-sparse, 
and establish analogously a set of theoretical results. Such a 
requirement arises from applications such as nondestructive 
testing (NDT) or network tomography. We also present 
numerical examples to demonstrate the superior performance 
of Info-Greedy Sensing compared to a batch method (where 
measurements are not adaptive) when there is mismatch. 

Our notations are standard. Denote [n] = {1,2,...,n}; ||X||, 
||X||i7, and \\X\\^ represent the spectral norm, the Frobenius 
norm, and the nuclear norm of a matrix X, respectively; let 
iyi{'E) denote the ith largest eigenvalue of a positive semi-definite 
matrix S; ||x||o, ||^||i, and ||x|| represent the Iq, ii and £2 norm 
of a vector x, respectively; let Xn the quantile function of the 
chi-squared distribution with n degrees of freedom; let E[a:] and 
Var[a:] denote the mean and the variance of a random variable 
x; we write X ^ 0 to indicate that the matrix is positive semi- 
definite; 0(x|/i, S) denotes the probability density function of 
the multi-variate Gaussian with mean /i and covariance matrix 
S; let ej denote the jth column of identity matrix I (i.e., ej 
is a vector with only one non-zero entry at location j)\ and 
(x)+ = max{x, 0} for x G M. 



five 238 by 375 
low-resolution images 
captured by random 
measurements 


recovered 1904 by 3000 
high-resolution image 


Fig. 1 : Value of information in sensing a high-resolution image of 
size 1904x3000. Here, compressive linear measurements correspond to 
extracting the features in compressive imaging Q-(^ In this 

example, the compressive imaging system captures 5 low resolution 
images of size 238-by-275 using masks designed by Info-Greedy 
Sensing or random masks (this corresponds to compressing the data into 
8.32% of its original dimensionality). Info-Greedy Sensing performs 
much better than random features and preserves richer details in the 


recovered image. Details are explained in Section IV-C2 


The goal is to use a minimum number of measurements (or total 
power) so that the estimated signal is recovered with precision 
e; i.e., ||x — x|| <5 with a high probability p. Define 

Xn,p,e = e'^lxlip), 

and we will show in the following that this is a fundamental 
quantity that determines the termination condition of our 
algorithm to achieve the precision e with the confidence level 
p. Note that Xn,p,£ is a precision 5 adjusted by the confidence 
level. 


II. Info-Greedy Sensing 

A typical sequential compressed sensing setup is as follows. 
Let X G be an unknown n-dimensional signal. We make K 
measurements of x sequentially 

Pk = alx^Wk, k = l,...,K, 

and the power of the measurement is ||a/c|p = Pk- The goal is 
to recover x using measurements {yk}k:=i' Consider a Gaussian 
signal X ^ A/'(0, S) with known zero mean and covariance 
matrix S (here without loss of generality we have assumed 
the signal has zero mean). Assume the rank of S is s and the 
signal can be low-rank s n (however, the algorithm does not 
require the covariance to be low-rank). Info-Greedy Sensing 0 
chooses each measurement to maximize the conditional mutual 
information 

ak ^ argmax I[x;a'^xw\yj,aj,j < k] /Pk- (1) 

a 


A. Gaussian signal 

In 0, we have devised a solution to Q when the signal is 
Gaussian. The measurement will be made in the directions of 
the eigenvectors of H in a decreasing order of eigenvalues, and 
the powers (or the number of measurements) will be such that 
the eigenvalues after the measurements are sufficiently small 
(i.e., less than s). The power allocation depends on the noise 
variance, signal recovery precision 5, and confidence level p, as 
given in Algorithmic 

B. Gaussian mixture model (GMM) signals 

The probability density function of GMM is given by 

c 

p(x) = ^Trc(p(xl/2c,'Ec), 

C=1 

where C is the number of classes, and tTc is the probability that 
sample is drawn from class c. Unlike for Gaussian signals, the 
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Algorithm 1 Info-Greedy Sensing for Gaussian signals 

Require: assumed signal mean fi and covariance matrix S, 
noise variance recovery accuracy 5, confidence level p 

1 : repeat 

2: (A, u) ^ largest eigenvalue and associated normalized 

eigenvector of S 
3: /3 ^ cr^(l/x„,p,e - 1/A) + 

4: a = y/pu, y = a^x {measure} 

5: /i ^ /i + /(^A + cr^) {mean} 

6: S ^ S — Saa'''S/(/3A cr^) {covariance} 

7: until null < Xn,p,£ {all eigenvalues small} 

8: return signal estimate x = p 


mutual information of GMM has no explicit form. However, 
for GMM signals, there are two approaches that tend to 
work well: Info-Greedy Sensing derived based on a gradient 
descent approach ||^, | |T9| uses the fact that the gradient of 
the conditional mutual information with respect to a is a 
linear transform of the minimum mean square error (MMSE) 
matrix p4| , p5| , and the so-called greedy heuristic which 
approximately maximizes the mutual information. The greedy 
heuristic picks the Gaussian component with the highest posterior 
TTc at that moment, and chooses the next measurement a to 
be its eigenvector associated with the maximum eigenvalue, 
as summarized in Algorithm The greedy heuristic can be 
implemented more efficiently compared to the gradient descent 
approach and sometimes have competitive performance (see, e.g. 

©)• 

Algorithm 2 Heuristic Info-Greedy Sensing for GMM signals 

Require: number of components C, assumed means {/ic}» 
covariances {Sc}, initial weights {tTc}, noise variance cr^, 
confidence level p 

1: repeat 

2: c* ^ arg maxc tTc 

3: (A, u) ^ largest eigenvalue and associated normalized 

eigenvector of Sc* 

4: P ^ a‘^{l/xn,p,s — 1/^)^ 

5: a = y/pu, y = a^x -^w {measure} 

6: for c = 1,..., C do 

7: [{y - hc)l (a^Sca + cr^)]Sca 

8: Sc — Sc — ScUU'^Sc/(u'^ScU “h 

9: TTc ^ A:7rcexp{-^(^ - aT/ic)^/(aTSca + cr^)} 

10: (K: normalizing constant) 

11: end for 

12: until ||Sc*|| < Xn,p,8 

13: return signal class c* = argmaXcTTc, estimate x = pc* 


C. One-sparse measurement 

The problem of Info-Greedy Sensing with sparse measurement 
constraint, i.e., each measurement has only /cq non-zero entries 


||<7||o = ko, has been examined in ||^ and solved using outer 
approximation (cutting planes). Here we will focus on one-sparse 
measurements, ||a||o = 1, as it is an important instance arising 
in applications such as nondestructive testing (NDT). 

Algorithm 3 Info-Greedy Sensing with sparse measurement 
||a||o = 1, for Gaussian signals 

Require: assumed signal mean p and covariance matrix S, 
noise variance recovery accuracy 5, confidence level p 

1 : repeat 

2: ^ arg maxj 'Ejj 

3: y = a'^x w {measure} 

4: /i ^ /i + ^Ci{y — a'^p)/{pTij*j* + cr^) {mean} 

5: S ^ S — + cr^) {covariance} 

6: until null < Xn,p,e {^H eigenvalues small} 

7: return signal estimate x = p 


Info-Greedy Sensing with one-sparse measurements can be 
readily derived. Note that the mutual information between x and 
the outcome using one-sparse measurement yi = e'jX wi is 
given by 

^[x;yi] = + 1), 

where denote the jth diagonal entry of matrix H. Hence, the 
measurement that maximizes the mutual information is given 
by Cj* where j* = argmaxj i.e., measuring in the signal 
coordinate with the largest variance or largest uncertainty. Then 
Info-Greedy Sensing measurements can be found iteratively, as 
presented in Algorithm Note that the correlation of signal 
coordinates are reflected in the update of the covariance matrix: 
if the Ah and jth coordinates of the signal are highly correlated, 
then the uncertainty in j will also be greatly reduced if we 
measure in A A similar algorithm with one-sparse measurement 
for GMM signals can be derived, where in each iteration we 
select the component with the largest weight and measure in 
the signal coordinate with largest variance. 

D. Updating covariance with sequential data 

If our goal is to estimate a sequence of data xi, X 2 ,... (versus 
just estimating a single instance), we may be able to update the 
covariance matrix using the already estimated signals simply via 

= oiTit-i + (1 — a)xtxl , t = 1, 2,..., (2) 

and the initial covariance matrix is specified by our prior 
knowledge So = S. Using the updated covariance matrix St, 
we design the next measurement for signal Xt+i. This way we 
may be able to correct the inaccuracy of S by including new 
samples. We refer to this method as “Info-Greedy-2” hereafter. 

III. Performance bounds 

In the following, we establish performance bounds, for cases 
when we (1) sense Gaussian and GMM signals using estimated 
covariance matrices; (2) sense Gaussian signals with one-sparse 
measurements. 
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algorithm parameter true parameter 


Fig. 2: Parameter updates performed by the algorithm and updates 
happen on the true distribution. 


A. Gaussian case with model mismatch 

To analyze the performance of our al^rithms when the 
assumed covariance S used in Algorithm [ij is different from 
the true signal covariance matrix S, we introduce the following 
notations. Let the eigenpairs of S with the eigenvalues (which 
can be zero) ranked from the largest to the smallestjto be 
(Ai, r^i), (A 2 , 1 ^ 2)5 • • •, (An, Un), and let the eigenpairs of S with 
the eigenvalues (which can be zero) ranked from the largest 
to the smallest to be (Ai, fii), (A 2 , '^ 2 )^- •, (An, ^n)- Let the 
updated covariance matrix in Algorithm [ij starting from H after 
k measurements be H/., and the true posterior covariance matrix 
of the signal conditioned on these measurements be The 
relations of these notations are illustrated in Fig. 

Note that since each time we measure in the direction of 
the dominating eigenvector of the posterior covariance matrix, 
(Xk^Uk) and {Xk^Uk) correspond to the largest eigenpair of 
T^k-i and H/c-i, respectively. Furthermore, define the difference 
between the true and the assumed conditional covariance matrices 
after k measurements as 

Ej^ — ^/c ^/c, ^ , -E- 

and their sizes 

4 = ||-Efc||, k = l,...,K. 

Let the eigenvalues of be ei > 62 > • • • > e„; then 4 = 
max{|ei|, |e„|}. Let 

4 = 1|£-s|| 

denote the size of the initial mismatch. 

1) Deterministic mismatch: First we assume the mismatch 
is deterministic, and find bounds for bias and variance of the 
estimated signal. Assume the initial mean is jl and the true 
signal mean is /i, the updated mean using Algorithm after k 
measurements is ftk, and the true posterior mean is fik- 

Theorem 1 (Unbiasedness). After k measurements, the expected 
difference between the updated mean and the true posterior 
mean is given by 

k 

E[Afc - Mfe] = (A - M) • fliln - 

j = l Pj'^j + ^ 

Moreover, if jl = pi, i.e., the assumed mean is accurate, the 
estimator is unbiased throughout all the iterations E[/i/c —/i/c] = 
0, for k = 1,... ,K. 


gracefully. This is captured through the reduction of entropy, 
which is also a measure of the uncertainty in the estimator. 
In particular, we consider the posterior entropy of the signal 
conditioned on the previous measurement outcomes. Since 
the entropy of a Gaussian signal x ^ Xf {pi^ H) is given 


by 


\x\ = In 


the conditional mutual 


(27re)"/2 det^/2(I]) 
information is the log of the determinant of the conditional 
covariance matrix, or equivalently the log of the volume of the 
ellipsoid defined by the covariance matrix. Here, to accommodate 
the scenario where the covariance matrix is low-rank (our earlier 
assumption), we consider a modified definition for conditional 
entropy, which is the log of the volume of the ellipsoid on the 
low-dimensional space that the signal lies on: 

H [a: 1j <k]= ln[(27re)*/2Vol(I]fe)], 


where \lo\{Ek) is the volume of the ellipse defined by T^k equal 
to the product of its non-zero eigenvalues. 


Theorem 2 (Entropy of estimator). If for some constant S G 
(0,1) the error satisfies 

11^ ~ ^11 — Xn,p,si 

then for k = ^ K, 


^[x\yj,aj,j < fc] < I jlnpTre tr(I])] - y^ln(l//j) 


where 


(3) 


= k = l,...,K. (4) 

In the proof of Theorem we use the trace of the underlying 
actual covariance matrix tr(E/c) as potential function, which 
serves as a surrogate for the product of eigenvalues that 
determines the volume of the ellipsoid and hence the entropy, 
since it is much easie^ to calculate the trace of the observed 
covariance matrix 11(11/^;). The following recursion is crucial 
for the derivation: for an assumed covariance matrix E, after 
measuring in the direction of a unit norm eigenvector u with 
eigenvalue A using power (3, the updated matrix takes the form 
of 


Acr^ 


/3A + (72 


uu^ + E 


Xu 


(5) 


where is the component of E in the orthogonal complement 
of u. Thus, the only change in the eigen-decomposition of E 
is the update of the eigenvalue of u from A to Xcr‘^/{(3X -f cr^). 
Based on after one measurement, the trace of the covariance 
matrix becomes 


PkXl 

PkXk + 


Next we show thatjhe variance of the estimator, when 
the initial mismatch HE — E|| is sufficiently small, reduces 


4 


tr(E/c) = tr(E/c_i) 


( 6 ) 











Remark 1. The upper bound of the posterior signal entropy in 
0 shows that the amount of uncertainty reduction by the kth 
measurement is roughly (s/ 2 ) ln(l///c). 


Remark 2. Use the inequality ln(l — x) < —x for x G (0,1), 
we have that in 0 


H [x I yj,aj,j < fc] < I ln[27retr(I])] - ^ 

i=l/3iAj+(T2 

= i ln[27retr(I])] - ^ 

j = l 


On the other hand, in the ideal case if the true covariance matrix 
is used, the posterior entropy of the signal is given by 


1 ^ 1 A 

Elideai [x, \yj,aj,j <k] = - ln[(27re)^ JJ A^] - x X] — 

^ j=i ^ j=i 

(7) 

where jSj = {ljxn,p,s — Hence, we have 

W[x\yj,aj,j < k] 

^ ^ideal [^5 \yj t j — k] C 

k ■- 

- 55 ; 


i=i 


+ (1 - <5) (1 - 


Xn,'j 


( 8 ) 


where C = | ln[tr(i;)/y 11^=1 ^j] ^ constant independent 

of measurements. This upper bound has a nice interpretation: 
it characterizes the amount of uncertainty reduction with each 
measurement. For example, when the number of measurements 
required when using the assumed covariance matrix versus using 
the true covariance matrix are the same, we have Xj > Xn,p,£ 
and Xj > Xn,p,£- Hence, the third term in ([^ is upper bounded 
by —/c/2, which means that the amount of reduction in entropy 
is roughly 1/2 nat per measurement. 


Remark 3. Consider the special case where the errors only 
occur in the eigenvalues of the matrix but not in the eigenspace 
U, i.e., S —S = Udiag{ei^ • • • , es}^^ and maxi<j<s \ej \ = Sq, 
then the upper bound in 0 can be further simplified. Suppose 
only the first K {K < s) largest eigenvalues of S are larger 
than the stopping criterion Xn,p,£ required by the precision, i.e., 
the algorithm takes K iterations in total. Then 

H [x I yj,aj,j <k]< Hjdeai [a:, \yj,aj,j < k] 

+ K ln(l + 5 k lxn,p,£) 

S 

+ ln(l + ((^0 + <^K)/Aj). 

The additional entropy relative to the ideal case Hideai Is 
typically small, because 6 k < (according to Lemma^in 
the appendix), (5o is on the order of e^, and hence the second 
term in the appendix is on the order of K^; the third term will 
be small because 6 k cire small compare to Xj. 


Note that, however, if the power allocations Pi are calculated 
using the eigenvalues of the assumed covariance matrix S, after 
K = s iterations, we are not guaranteed to reach the desired 
precision 5 with probability p. However, this becomes possible 
if we increase the total power slightly. The following theorem 
establishes an upper bound on the amount of extra total power 
needed to reach the same precision 5 compared to the total 
power Pideai if we use the correct covariance matrix. 


Theorem 3 (Additional power required). Assume K < s 
eigenvalues of S are larger than Xn,p,£- If 

11^ ~ ^11 — ^s+l Xn,p,£i 

then to reach a precision e at confidence level p, the total 
power Pmismatch required by Algorithm when using H is 
upper bounded by 


Pmi: 


mismatch 


Pideai P 


20 1 

— S “h - K 

51 272 


Xn,p,£ 


Remark 4. In a special case when K = s eigenvalues of S are 
larger than Xn,p,£y under the conditions of Theorem we have 
a simpler expression for the upper bound 

^ ^ ^ , 323 ^ 

Pmismatch P Pideai F S. 

816 Xn,p,E 

Note that the additional power required is quite small and is 
only linear in s. All other parameters are independent of the 
input matrix. 


2) Initialize S.- In the following we present schemes to 
estimate S to reach the desired precision in Theorem i (1) 
using sample covariance matrix if we are able to obtain full 
dimensional training samples; (2) using covariance sketching 
to estimate the covariance using random projections of the full 
dimensional training samples. 

Suppose the sample covariance matrix is obtained from 
training samples xi,..., that are drawn i.i.d. from A/’(0, S), 

and H = (1/P) Then we need L to be sufficiently 

large to reach the desired precision. The following Lemma 
arises from a simple tail probability bound of the Wishart 
distribution (since the sample covariance matrix follows a 
Wishart distribution). 


Lemma 1 (Initialize with sample covariance matrix). For any 
constant 6 > ^, we have HH —SH < 6 with probability exceeding 
1 — 2nexp(—y57), as long as 

P>4n^/2tr(S)(||S||/(5^+4/(5). 

Lemma [T] shows that the number of measurements needed to 
reach a precision 6 for a sample covariance matrix is 0(1/6‘^) 
as expected. 

We may also use a covariance sketching scheme similar to that 
described in ||^-||^ to estimate H. Covariance sketching is 
based on random projections of each training samples, and hence 
it is memory efficient when we are not able to store or operate 
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Fig. 3: Diagram of covariance sketching in our setting. The circle 
aggregates quadratic sketches from branches and computes the average. 


on the full vectors directly. The covariance sketching scheme 
is described below and illustrated in Fig. Assume training 
samples i = 1,..., are drawn from the signal distribution. 
Each sample, Xi is sketched M times using random sketching 
vectors 6^^, j = 1,..., M, through a noisy linear measurement 
{bJjXi^Wiji)‘^, and we repeat this for L times (/ = 1,..., L) and 
compute the average energy to suppress nois^ This sketching 
process can be shown to be a linear operator B applied on the 
original covariance matrix S, as shown in Appendix We 
may recover the original covariance matrix from the vector of 
sketching outcomes 7 G by solving the following convex 
optimization problem 

E = argmin;^ tr(X) 

subject to X ^ 0, II7 — B(X)||i < r, 

where r is a user parameter that depends on the noise level. In 
the following theorem, we further establish conditions on the 
covariance sketching parameters N, M, L, and r so that the 
recovered covariance matrix S may reach the required precision 
in Theorem by adapting the results in p2| . 


Lemma 2 (Initialize with covariance sketching). For any (5 > 0 
the solution to ^ satisfies ||S — S|| < S, with probability 
exceeding 1 — ^jn — yjn — 2nexp(—— exp(—ciM), as 
long as the parameters M, N, L and r satisfy the following 


^Our sketching scheme is slightly different from that used in |22) because 
we would like to use the square of the noisy linear measurements (where 
as the measurement scheme in |22) has a slightly different noise model). In 
practice, this means that we may use the same measurement scheme in the first 
stage as training to initialize the sample covariance matrix. 


conditions 


M > Cons, 

. 1/2 .v.N.36M2n2||S|| 24Mn, 

N > 4n^/2tr(E)(-+-), 


( 10 ) 

( 11 ) 


L > max < 


M 


1 




2 6M 2 


4n2||S|r ’ y2[tr(E)/||E||]Mn2'^ ’ r 


( 12 ) 

T = M6/c2, (13) 

where cq, ci, and C 2 are absolute constants. 


B. Gaussian mixture model (GMM) 


We also establish a lower bound on the number of measure¬ 
ments (or power) required to recover a GMM signal with high 
precision when there is model mismatch. The proof follows by 
identifying a connection between the Info-Greedy Sensing and 
the so-called multiplicative weight update (MWU) algorithms 
(see e.g., |[^-|[^). The MWU method is actually a meta¬ 
algorithm and its instantiations span a large family of algorithms. 
It has been re-derived under various names in various disciplines. 
MWU algorithms maintain a distribution over experts (which 
corresponds to different Gaussian components in our case) and 
form a solution by e.g., a majority vote or an average over 
the solutions suggested by the experts. (Here each Gaussian 
component will suggest a sensing vector.) The weights are 
updated in each round according to the posterior update. We 
will use the hedge version of MWU in deriving the result. 


Theorem 4 (GMM with Mismatch). Denote the posterior mean 
and covariance of component c after k iterations as pc,k 
and their perturbed counterparts as jl^k 5]c,/c, 

respectively. Let Sc,k = \\^c,k — Sc,/c||. tzfzJ rUc be the number of 
measurements (or power) required to ensure ||x — x|| < c with 
probability p for a Gaussian signal ^c) corresponding to 

component c for all c G [C] if we start with sample covariance 
matrix Sc. Then if both the mismatch in the initial mean 
\a'^(p — jl)\ and the initial covariance ||S — S|| are sufficiently 
small so that po = 0{fj), then for a signal sampled from the 
d^th component, we need at most 


c 

^rric + O 

C=1 


ln|C| \ 
V + Vo) 


amount of power to ensure ||x — x|| < c with probability p(l — 
fj — l/n) when sampling from the posterior distribution of ir 
with probability. Here fj = 1/2, 

Vo = ^ • max{cr^(|p/c| F2nbk)\gk\ + , 


Qk — tl^(/ic*,/c Fc*,k)') 



{^c*,k + ^c*,k-l) + + {aj,{pc* 


Fc* ,/c —1 ))^ • 
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Note that here the constants are defined in terms of maximizing 
from 1 to n. This can be understood as if we run the algorithm 
until we have aquired n measurements. 

Remark 5. If our goal is to detect the correct component (rather 
than recovering the signal itself), we need at most O 
samples if the true component is c*. 

Remark 6. Compared to GMM result without mismatch, which 
is on the order of 0{\n\C\/fi) this upper bound actually 
requires a smaller number of measurements 0 (ln |C|/(7} + ? 7 o))- 
This is consistent with our intuition, and it says that if our 
estimation accuracy for the covariance matrices is low, then 
we should not ''labor” as much. Because the estimation error 
will create an "error floor” which does not decrease by making 
more measurements, and it is not meaningful to make additional 
measurements below the noise floor. Of course, when there is 
covariance error, the overall error bound will be higher as well 
(which is already captured by a larger error bound). 


C. One-sparse measurement 

In the following we provide performance bounds for the case 
of one-sparse measurements in Algorithmic Assume the signal 
covariance matrix is known precisely. Now that ||a/c||o = 1, 
we have where G {ei, • • • , e^}. Suppose the 

largest diagonal entry of is determined by 

j/e_i = argmaxSJ^ 


From the update equation for the covariance matrix in Algorithm 
IC the largest diagonal entry of can be determined from 


Jk 


arg max 








Jk—lJk 

Let the correlation coefficient be denoted as 




Y'^k)Y^{k) 

^33 


where the covariance of the ith and jth coordinate of x after k 


measurements is denoted as H 


(/c) 


Lemma 3 (One sparse measurement. Recursion for trace of 
covariance matrix). Assume the minimum correlation for the 
kth iteration is G [0,1) such that \ for 

any i G [n]. Then for a constant 7 > 0, if the power of the kth 
measurement pk satisfies pk ^ t (Tmaxt we have 


tr(i;/c) < 


(n — l)p^^ + 1 

n(l + 7) 


tr(i:/e_i). 


(14) 


Lemma [C provides a good bound for a one-step ahead 
prediction for the trace of the covariance matrix, as demonstrated 
in Fig. IC Using the above lemma, we can obtain an upper 
bound on the number of measurements needed for one-sparse 
measurements. 


Theorem 5 (Gaussian, one-sparse measurement). For con¬ 
stant 7 > 0, when power is allocated satisfying Pk > 
< 7^/(7 maxt for k = 1,2,..., K, we have ||x — x|| <5 

with probability p as long as 


K > max 


ln[tr(I])/xn,p,g] 

l-l/[n(l+7)] 



(15) 



Fig. 4: One-step ahead prediction for the trace of the covariance 
matrix: the offline bound corresponds to applying OH iteratively k 
times, and the online bound corresponds to predicting tr(E/c) using 
tr(Efe_i). Here n — 100, p = 0.95, £ = 0.1, E = dd'^ + bin where 
[!,••• ,1]". 

The above theorem requires the number of iterations to be 
on the order of ln{l/e) to reach precision 5 (recall Xn,p,E = 
s‘^/Xn{p))^ as expected. It also suggest a method to allocate 
power: set Pk to be proportional to ^^is 

captures the inter-dependence of the signal entries as these 
dependence will be affect the diagonal entries of the updated 
covariance matrix. 

IV. Numerical examples 

In the following, we have three sets of numerical examples to 
demonstrate the performance of Info-Greedy Sensing when there 
is mismatch in the signal covariance matrix, when the signal is 
sampled from Gaussian, and from GMM models, respectively. 

A. Sensing Gaussian with mismatched covariance matrix 

When the assumed covariance matrix for the signal x is 
equal to its true covariance matrix, Info-Greedy Sensing is 
identical to the batch method 02) (the batch method measures 
using the largest eigenvectors of the signal covariance matrix). 
However, when there is a mismatch between the two, Info-Greedy 
Sensing outperforms the batch method due to its adaptivity, as 
shown by the example demonstrated in Fig. (with K = 20). 
Further performance improvement can be achieved by updating 
the covariance matrix using estimated signal sequentially such 
as described in 0- Info-Greedy Sensing also outperforms 
the sensing algorithm where ai are chosen to be random 
Gaussian vectors with the same power allocation, as it uses prior 
knowledge (albeit being imprecise) about the signal distribution. 
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Fig. demonstrates an effect that when there is a mismatch 
in the assumed covariance matrix, better performance can be 
achieved if we make many lower power measurements than 
making one full power measurement because we update the 
assumed covariance matrix in between. 



ordered trials 

Fig. 5: Sensing a Gaussian signal of dimension n — 100, when there is 
mismatch between the assume covariance matrix and the true covariance 
matrix: E oc E + RW, where R G and each entry of Rij ~ 

A/^(0,1). We repeat 1000 Monte Carlo trials and for each trial we use 
K = 20 measurements. The Info-Greedy-2 method corresponds to 
where we update the assumed covariance matrix sequentially each time 
we recover a signal and a = 0.5. 


B. One-sparse measurements 

In this example, we sense a GMM signal with a one-sparse 
measurement. Assume there are C = 3 components and we know 
the signal covariance matrix exactly. We consider two cases of 
generating the covariance matrix for each signal: when the low 



Fig. 6: Comparison of sensing a Gaussian signal with dimension 
n = 100 using unit power measurements along the eigenvector direction, 
versus splitting ea ch u nit-power measurement into 5 smaller ones, each 
with amplitude y^l/5, and we update the covariance matrix in between. 
The mismatched covariance matrix is E o^E + rr^, where r G 
and each entry of r is i.i.d. ^"(0,1), and E is normalized to have unit 
spectral norm. 


rank covariance matrices for each component are generated 
completely at random, and when it has certain structure. Fig. 
1^ shows the reconstruction error \\x — x\\, using iC = 40 
one-sparse measurements for GMM signals. Note that Info- 
Greedy Sensing (Algorithm with unit power = 1 can 
significantly outperform the random approach with unit power 
(which corresponds to randomly selecting coordinates of the 
signal to measure). Fig. also compares the mis-classification 
rate of Info-Greedy Sensing with one-sparse measurements to 
that with using a full signal vector x for classification. Note 
that, interestingly, using AT = 50 one-sparse measurements we 
can obtain a performance very similar to the ideal case, which 
can be explained since we exploit the correlation structure of 
the signal. 




Fig. 7: Sensing a low-rank GMM signal of dimension n = 100 using 
A = 40 measurements with a = 0.001, when the eovarianee matriees 
are generated (a) completely randomly, or (b) having certain structure. 
The covariance matrices Ec are normalized so that their spectral norms 
are 1. 


C. Real data 

1) Sensing of a video stream using Gaussian model: In this 
example, we use a video from the Solar Data Observatory. The 
frame is of size 232 x 292 pixels. We use the first 50 frames 
to form a sample covariance matrix E, and use it to perform 
Info-Greedy Sensing on the rest of the frames. We take A = 90 
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a Gaussian mixture model, as demonstrated in Fig. [T] We break 
the image into 8 by 8 patches, which result in 89250 patches. 
We randomly select 500 patches (0.56% of the total pixels) to 
estimate a GMM model with C = 10 components, and then 
based on the estimated GMM initialize Info-Greedy Sensing 
with K = b measurements and sense the rest of the patches. 
This means we can use a compressive imaging system to capture 
5 low resolution images of size 238-by-275 (this corresponds to 
compressing the data into 8.32% of its original dimensionality). 
With such a small number of measurements, the recovered image 
from Info-Greedy Sensing measurements has superior quality 
compared with those with random masks. 


Fig. 8: Classifying a signal of dimension n = 100 generated from a 
GMM model with covariance matrix generated according to E oc RK^ 
and the true distribution is tt = (0.5,0.3,0.2). We assume a uniform 
initial distribution (1/3,1/3,1/3). Misclassification rate versus the 
number of measurements K. Ideal case corresponds to where we 
observe x and run a quadratic discriminate analysis using the full 
vector X (i.e. rather than just observing a noisy version of an entry of 
X each time). 


measurements. As demonstrated in Fig. Info-Greedy Sensing 
performs much better in that it acquires more information such 
that the recovered image has much richer details. 


original 


I J 

y 

J 


(a) 


random 




(b) 

Info-Greedy 


* s 

J 


(c) 


(d) 


Fig. 9: Recovery of solar flare images of size 224 by 288 with A = 90 
measurements and no sensing noise. We used the first 50 frames to 
estimate the mean and covariance matrix of a single Gaussian, (a) 
original image for 300th frame; (b) ordered relative recovery error of 
the 200th to the 300th frames; (c) recovered the 300th frame using 
random measurement; (d) recovered the 300th frame using Info-Greedy 
Sensing. 

2) Sensing of a high-resolution image using GMM: We con¬ 
sider a scheme for sensing a high-resolution image that exploits 
the fact that the patches of the image can be approximated using 


V. Conclusions and discussions 

In this paper, we have explored the value of information and 
how to use such information in sequential compressive sensing, 
by examining the Info-Greedy Sensing algorithms when the 
signal covariance matrix is not known exactly. We quantify the 
algorithm performances in the presence of estimation errors and 
when only one-sparse measurements are allowed. 

Our results for Gaussian and GMM signals are quite general in 
the following sense. In high-dimensional problems, a commonly 
used low-dimensional signal model for x is to assume the signal 
lies in a subspace plus Gaussian noise, which corresponds to the 
case where the signal is Gaussian with a low-rank covariance 
matrix; GMM is also commonly used (e.g., in image analysis 
and video processing) as it models signals lying in a union of 
multiple subspaces plus Gaussian noise. In fact, parameterizing 
via low-rank GMMs is a popular way to approximate complex 
densities for high-dimensional data. Hence, we may be able 
to couple the results for Info-Greedy Sensing of GMM with 
the recently developed methods of scalable multi-scale density 
estimation based on empirical Bayes p9| to create powerful 
tools for information guided sensing for a general signal model. 
We may also be able to obtain performance guarantees using 
multiplicative weight update techniques together with the error 
bounds in 
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Appendix A 
Covariance sketching 

Consider the following setup for covariance sketching. Sup¬ 
pose we are able to form measurement in the form of ^ = 
a^x f- w like we have in the Info-Greedy Sensing algorithm. 
Suppose there are N copies of Gaussian signal we would like 


to sketch: xi,..., Xjv that are i.i.d. sampled from ^'(O, S), and 
we sketch using M random vectors: 6 i,..., 6 m- Then for each 
fixed sketching vector 6 ^, and fixed copy of the signal Xj, we 
acquire L noisy realizations of the projection result yiji via 

Vijl ~ “ 1 “ IPijl') ^ — 1 , . . . , Z/. 


We choose the random sampling vectors bi as i.i.d. Gaussian 
with zero mean and covariance matrix equal to an identity matrix. 
Then we average yiji over all realizations / = 1,..., L to form 
the ith sketch yij for a single copy Xj\ 

1 ^ 

^ 1=1 
" -V-" 

Wij 

The average is introduced to suppress measurement noise, which 
can be viewed as a generalization of sketching using just one 
sample. Denote Wij = xYlf=i'^iji^ which is distributed as 
A/'(0, /L\ Then we will use the average energy of the sketches 

as our data 7 ^, i = 1 ,..., M, for covariance recovery: 


A 

li = 


1 ^ 

7=1 


Note that 7 ^ can be further expanded as 
7i = tr(Siv&i&J) 


9 ^ 

-y 

N ^ 


Wijbjxj 


7=1 


where 


Sat = 


1 ^ 


i=i 


1 ^ 

7=1 


XjX], 


(16) 


is the maximum likelihood estimate of S (and is also unbiased). 
We can write in vector matrix notation as follows. Let 
7 = [ 71 U • •7m]''’- Define a linear operator S : i-A 

such that [B{X)]i = tr(X 6 i 6 j). Thus, we can write (16) as a 
linear measurement of the true covariance matrix S 


7 = B(S)+,?, 


where 77 G contains all the error terms and corresponds to 
the noise in our covariance sketching measurements, with the 
7th entry given by 

2 ^ 1 ^ 

r]i = bj (Ejv + WijbJxj + 

i=i i=i 

Note that we can further bound the ii norm of the error term as 




where 


M M 

\Vi\ ^ W'^N — L;||6 + 2 \zi \ + fc, 


i=l 


i=l 


M 

b = Y, l| 6 i||^ E[ 6 ] = Mn, Var[ 6 ] = 2Mn, 
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i=l j=l 



We may recover the true covariance matrix from the sketches 7 
using the convex optimization problem 0. 


Appendix B 
Backgrounds 


Lemma 4 (Eigenvalue of perturbed matrix pQ|). Let E, 
E G be symmetric,with eigenvalues Ai > • • • > and 

Ai > • • • > \ny respectively. Let E = H have eigenvalues 
d > • • • > e^. Then for each i G {1, • * * the perturbed 


eigenvalues satisfy Xi G [A^ + e^, A^ + ef. 


Lemma 5 (Stability conditions for covariance sketching |p2|). 
Denote A : ^ a linear operator and for X G 

A{X) = {afXai}'^^. Suppose the measurement is contam¬ 
inated by noise p G R^, i.e., Y = w4(E) + p and assume 
ll^lli < ^ 1 . Then with probability exceeding 1 — exp(—cim) the 
solution E to 



for all E G provided that m > c^nr. Here cq, ci, and 

C 2 are absolute constants and represents the best rank-r 
approximation of E. When is exactly rank-r 


||S-S||f <co-. 

m 


Lemma 6 (Concentration of measure for Wishart distribution 


(13). IfX€ M”""” 


^ Wn{X, E), then for t > 0, 



where 0 = tr(E)/||E||. 
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Appendix C 
Proofs 


A. Gaussian signal, with mismatch 

Proof of Theorem^ Let = t^k — h'k- From the up¬ 
date equation for the mean jlk = yU/c-i + T^k-iCik{yk — 
a^jlk-i)/{a[T^k-iCik + since Ok is eigenvector of S/c-i, 

we have the following recursion: 

^ _/r ^kCikCik 

4/c — Gn - - -)^k-l 


Z^/cA/c + 


+ 


-A/c 


alEk-iOk 


+ - 


{PkXk + - alEk-iak){l3kXk + cr^) 

Ek-iCik 


dk 


PkXk +cr2 -alEk-idk 


{al{x - fik-i) Ewk). 

(17) 

From the recursion of in for some vector Ck defined 
properly, we have that 


E[a] = {I ■ 


XkPk 


PkXk + O" 


-Ukul)E[^k-i 


-h Ck 'E[al{x - jik-i) + . 


( 18 ) 


and Cauchy-Schwartz inequality \\AB\\ < ||A||||5||, we have 
PkXkdkEk — lCik 


^k ^ ^k-1 + 


+ 


{PkXk + d‘^){PkXk + - alEk-ia) 

1 


• \\AkE^k-i\\ 


^ ^k-l + 


PkXk + -alEk-idk 

• [A/c(||^/cFZ/c-i II + ||L^/c_i A/c II) + ||L^/c_i A/cL^/c_i I 


<(1 + 
+ 


{PkXk + 0-‘^){PkXk + Cr2 - ^/c4_i) 
[‘^XkSk-i + 


^/cA/c + cr^ — Pk^k-l 
3/3/c A/c 


^/cA/c + cr^ - ^/c(3/c_i 

Pk .2 


^.-1- 


/3/cA/c + cr^ - Pk^k-1 

if set 5k-i < 3cr^/(4/ 

inequality can be upper bounded by 


Hence, if set 5k-i < 3cr^/(4/3/c), i.e., 5k-iPk < fcr^, the last 


(1 + 3 • )4-i + 3 • —^- 44-1- 

^/cA/c + cr2/4 /3/cA/c + crV4 

Hence, if 4-i < 3cr^/(4/3/c), we have 4 < 44-i- 


□ 


Note that the second term is equal to zero using an argument 
based on iterated expectation 

E[a^(x - /i/c_i) + Wk] = alE[E[x - /i/c-i|^i, • •., Vk]] = 0. 

Hence Theorem is proved by iteratively apply the recursion 
(T^. When /io — /io = 0. we have E[4] = 0, /c = 0,• • •, Ff. 

□ 


Lemma 8 (Recursion for trace of the true covariance matrix). 

4-1 < A/c, 


tr(i:/c) < tT{Ek-i) - 


f^kXl 


3/3/cA/c4-i 


PkXk + /3/cA/c + cr2 - /3/c4-l 

( 21 ) 


In the following. Lemma [7] to Lemma are used to prove 
Theorem [J] 

Lemma 7 (Recursion in covariance matrix mismatch.). If 

4-1 < 3cr^/44, then 4 < 44-i- 

Proof Let A/c = a/cU^. Hence, ||A/c|| = 4- Recall that a/c is 
the eigenvector of S/c-i, using the definition of Ek = Ek — Ek, 
together with the recursions of the covariance matrices 

'Ek = Ek-i - Ek-iakal'Ek-i/{Xk + cr^), (19) 

E/c — S/c—i S/c—ia/ca^S/c—i/(a^^S/c—ia/c H- cr ), (20) 

we have 

j-, Ek—\akafEk—\ XkdkdfEk—x 

L/C = L/c_i + ^^-—^. 

ofEk-ldk + cr ^^A/c + CF^ 

Based on this recursion, using 4 = ||F//c||Ahe triangle inequality. 


Proof Let A/c = aka\. Using the definition of Ek and the 
recursions and ( [2Q| ), the perturbation matrix Ek after k 
iterations is given by 

alEk-idk 


Ek — Ek-i + Xj^Ak 


Xk 


{PkXk + cr‘^){PkXk + 0-2 - alEk-idk) 

• {AkEk-i + Ek-iAk) 


PkXk + 0-2 - alEk-idk 

H“ —- Ek—^AkEk—^- 

PkXk + 0-2 - alEk-idk 


( 22 ) 


Note that rank(A/c) = 1, thus rank(A/c4-i) < 1, therefore it 
has at most one nonzero eigenvalue, 

|tr(^fc£;fc_i)| = |tr(£;fc_i^fc)| 

= \\AkEk-i\\ < \\Ak\\\\Ek-i\\ = I3k5k-i. 

Note that 4-1 is symmetric and A/c is positive semi-definite. 
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we have ii{Ek_iAkEk_i) > 0. Hence, from (22) we have 
iT{Ek) = tr(£k) - tr(i:k) 


> tr(£’/e_i) - 

> tr(£’/c-i) - 


3/3/c A/c(^/eA/c + ^)5k-l 

(^/cA/c + cF‘^){Pk^k + - /3/c4-i) 

3/3/cA/c^/c-i 


/3/cA/c + cr^ — /3/c^/c-i 

After rearranging terms we obtain 

tr(i:/c) < tr(i:/c_i) + [tr(£/c)-tr(£/c_i)] + - 


3/3/cA/c^/c-i 


^/cA/c + Cr^ — /3/c(3/c_i 

Together with the recursion for trace of tr(Il/c) in we have 

3^/cA/c(3/c-i 


tr(i:/c) < tT{Ek-i) - 




Pk^k + Pk^k + - /3/c4-l 


□ 


Lemma 9. For a given positive semi-definite matrix X G 
and a vector h G if 


Y = X - 


1 

hTXh + cr2 


Xhh^X, 


then rank(A) = rank(y). 


Pr 6 > 6 >/ Apparently, for all x G ker(A), Yx = 0, i.e., ker(A) C 
ker(y). Decompose X = For all x G ker(y), let b = Qh, 
z = Qx. If 6 = 0, F = X; otherwise, when 6 7 ^ 0, we have 


Thus, 


0 = x'^Yx = z'^ z — 


z^bb^z 

bTbYa^' 


, z^bb^z b^b 
6T6 + cr^ b'^bFcr^ 


Therefore z = 0, i.e. x G ker(X), ker(F) C ker(X). This shows 
that ker(X) = ker(F), or equivalently rank(X) = rank(F). 

□ 


Proof of Theorem Recall that for /c = 1,..., if, A/c > Xn,p,s^ 
Using Lemma we can show that for some 0 < (3 < 1, if 
4 ^ ^Xn,p,£/^^^ Y: 3cr^/(4^+^/3i), (the second inequality 
comes from the fact that {l/xn,p,£ — 1 /Ai)xn,p,ecr^ < 3cr^), 
then for the first K measurements. 


4 < 


1 ^Xn,p,£ 

/^K—k-\-l ^ 


^ 1 3(7^ 

- 


k = f...,K. 


Clearly, 


4 — 1 ^Xn,p,e/lfi- 


Hence, 

(4 + 6)6k-i < SXk- 


Note that 44-i < and |A/c — A/c| < 4-i, we have 


/3/cA/c < /3/c(A/c + 4-i) < /3/cA/c + cr^. 


Thus, 

44-i(/3/cA/c + cr^) + dPk^k^k-i ^ <3A/c(/3/cA/c + cr^). 

Then we have 

34A/c4-i(4A/c+cr^) < 4A/c(^A/c-4-i)(/3/cA/e+cr^-/3/c4-i), 


which can be rewritten as 

3/3/c A/c4-i . /3/cA/c ^ 

-S -i^^/c — ^k — 1 

I3k>^k + 0-2 - /?fe4_i /3kXk + cr2 

Hence, 


3/3/c A/c4—i 

Pk^k Y Pk^k — 1 


< 


/3/cA/c 
4A/C + 


[(^ — l)A/c + A/c], 


which can be written as 


/3/cA| ^_ 3/3/cA/c4-i ^ Q 4A/c 

4A/C + 4A/C + cr2 - /3/c4-l ~ /3/cA/c + cr^ ^ 

By applying Lemma we have 


tr(S/c) < tr(F/c_i) - (1 - S)^ --A/c 

4A/C + 

< tr(I]fe_i) - (1 - (5) ^ ^ ^ 

/3/cA/c + cr^ 

where we have used the definition for fk in 0 . Subsequently, 

k 

tr(I]fc) < (H /i)tr(So). 
i=i 

Lemma [9] shows that the rank of the covariance will not 
be changed by updating the covariance matrix sequentially: 
rank(Fi) = • • • = rank(F/c) = s. Hence, we may decom¬ 
pose the covariance matrix H/c = QQ^, with Q G 
being a full-rank matrix, then \/o\{Yk) = det((5''’Q). Since 
tT{Q'^Q) = tT{QQ'^), we have 


( 1 ) ^ 

VoP(Efe) = det(QTQ) < p[(QTg)^.^. 

i=i 

(2) j^ tr(gTQ) y _ ^ tr(Fk) 


where (1) follows from the Hadamard’s inequality and (2) follows 
from the mean inequality. Finally, we can bound the conditional 
entropy of the signal as 

B.[x\yj,aj,j <k] = ln(27re)®/^Vol(Efe) 

k 

< |ln{27re(P[/j)tr(Eo)}, 

i=i 

which leads to the desired result. □ 


Proof of Theorem^ Recall that rank(i;) = s, and hence 
A/c = 0, k = ^ + 1,... ,n. Note that for each iteration, the 
eigenvalue of S/c in the direction of a/c, which corresponds to 
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the largest eigenvalue of is eliminated below the threshold 
Xn,p,£- Therefore, as long as the algorithm continues, the largest 
eigenvalue of is exactly the {k l)th largest eigenvalue of 
S. Now if 

<5o < (23) 

using Lemma and Lemma we have that 

|A/e - Xk\ < So, for A: = 1,... ,5, 

1^1 < ^^0 < Xn,p,s -Ss, for A: = s + 1,... ,n. 

In the ideal case without perturbation, each measurement 
decreases the eigenvalue along a given eigenvector to be below 
Xn,p,£- Suppose in the ideal case, the algorithm terminates at 
K < s iterations, which means 

Ai > • • • > Al > Xn,p,£ > Ak+i(I 1) > • • • > As(i;), 

and the total power needed is 


power needed is upper bounded as 


-^mismatch -Pideal 

f K 




1 1 


, 20(5-i^) 1 

^ Xn,p,£ A| 


51 


Xn,p,£ 


f 1 Xn,p,£ 205 — 3K 1 

/c=l 




Xn^p^e 


/'20 ,3 1 

^ (51® ^51 4*+i^ 

/20 1 0-2 
-\51 272 ) xn,p,s 


a 


Xn,p,£ 


where we have again used E ^^Xn,p,£/^^~^^ = 

Xn,p,s/^, 1/A/c - 1/A/c < So/Xl, the fact that Xk > Xn,p,s 
for A: = □ 


K 


Tideal — ^ ^ ^ 
k=l 




(24) Proof of Lemma It is a direct consequence of Lemma Let 
0 = tr(E)/||E|| > 1. For some constant (5 > 0, set 


On the other hand, in the presence of perturbation, the 
algorithm will terminate using more than K iterations since 
with perturbation, eigenvalues of E that originally below Xn,p,e 
may get above Xn.p.e^ In this case, we will also allocate power 
while taking into account the perturbation: 

Pk = <y^ (- 

\Xn,p,£ Ss A/c/ 

This suffices to eliminate even the smallest eigenvalue to be 
below threshold Xn,p,£ since 


cr^A/c-i 


Pk-lXk-1 +cr2 


— Xn,p,£ Sg < Xn,'j 


We first estimate the total amount of power used at most to 
eliminate eigenvalues A/^, for iX + 1 < A: < 5: 


Pk = cr^(l/(Xn,p,£ - - 1/A/c) 

— ^ (1/ {Xn,p,£ ~ Sg) — 1/ {Xn,p,£ H“ ^ 0 )) 

^ ^2 (4^ + 1)^0 < 20 

{Xn,p,£ 4'®(5o) (Xn,p,e “1“ ^o) bl Xn,p,£ 

where we have used the fact that Sg < 4^(5o (a consequence of 
Lemma 0, the assumption ( [^ , and monotonicity of the upper 
bound in 5. The total power to reach precision 5 in the presence 
of mismatch can be upper bounded by 


^mismatch E ^ ^ Pk 
k=l 


E CT 


2 



1 \ 20(5 - K) 

Xk J ^1 Xn,p,£ 


In order to achieve precision 5 and confidence level p, the extra 


L > 4n^/V(E)(||E||/(5^ +4/(5). 

Then from Lemma we have 

P{||E-E||<( 5 } 

> P{||S - Sll < (^fn^/^0 + l)/L + 29n^/^lLj ||I]||} 

> 1 — 2nexp(—Vn)- 


□ 


The following Lemma is used in the proof of Lemma 

Lemma 10. For the setup in Section ^ if for some constant M, 
N and L satisfies the conditions in Lemma^ then ||77||i < r 
with probability exceeding 1 — 2/n — 2}xjn — 2nexp(—ciM) 
for some universal constant ci > 0. 


Proof Let 0 = tr(E)/||E||. From Chebyshev’s inequality, we 
have that 

36MV2tr(E) 




and 


NLt"^ 




,, I T, 72(T^M 

P{|„| <M-+ -}>!- 


P{|6| < (M + \/M)n} > 1 - 


When 


„ > ,25) 

r 

with the concentration inequality for Wishart distribution in 
Lemma and plugging in the lower bound for N in ( [^ and 
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the definition for r in ( p3] ) we have 

¥{\\^n - S|| < r/[3n{M + Vm)]} 


>P{||S^-S||<( 
> 1 — 2nexp(—\/n). 


2nV26l 

N 


+ 


N 




Furthermore, when L satisfies ( p^ , we have 

P{|6| < (M + \/M)n} > 1 - 

n 

Therefore, ||7^||i < r holds with probability at least 1 — 2/n — 
2/yTi — 2nexp(—yTi). □ 


Proof of Lemma ^ With Lemma let r = M 6 /c 2 , the 
choices of M, A/", and L ensure that ||7^||i < M 5 /c 2 with 
OTobability at least 1 — 2/n — 2 /^/n — 2n exp(— ^/n). By Lemma 
l^and noting that the rank of H is 5, we have ||I1 — Hjli? < (5. 
Therefore, with probability exceeding 1 — 2/n — — 

2nexp(—yTi) — exp(—coCins), 

||£-I]|| < ||E-E||f<<5. 

□ 


The proof of Theorem will use the following two lemmas. 

Lemma 11 (Moment generating function of multivariate Gaus¬ 
sian p^). Assume X ^ A/’(0, S). The moment generating 
function 6>/ 11X11 2 is 


Proof of Theorem We adapt the technique used in @ for 
proving performance bound for GMM signal without mismatch. 
Suppose the true signal is generated from the c*th component. 
First, apply measurements to each component c G [C]. Clearly, 
spending a total amount of power X]c=i would suffice to 
ensure that the norm of covariance of each individual component 
is below Xn,p,£- In the ideal case, the weight is updated in the 
following manner: 


In the presence of mismatch, this becomes 


^/c + l 


'Lfcexpj-- — 


1 {Uk - aljlc,k-if 


^j^^c,k — l^k A’ < 


C€[C]. 


}, c&[C]. 


The Lk and for /c = 1, 2,... are normalization coefficients. 


After m measurements, 

m C 


1 

m 

1 

m 


{Vk ~ 

/ -j / -j T G' I 2 

k=l 1=1 ^k^^^k — l^k I CT 

m C / T ^ \2 

[Vk ~ 

/ ^ / -j T G I 9 

k=l i=l ^k^^ik — l^k “1“ ^ 

^ iVk — QfcAc*,fc-i)^ 
fc=i afe£c»,fe-iafe+0-2 


^ _ (yk 

^ ajEc.,fc_iafe + 0-2 

771/ T ^ \9 

-k iVk - (^ki^c*,k-l) 

k=l ^k^^* ^k—l^k “1“ ^ 

_ {Vk - Cllhc-^k-lY ^ 

a\T^c-,k-icik + 


<fjA- 


2 \n\C\ 

m 



k=l 


{Vk ik — l) 

alT^c*,k-iCik + 


{Vk ^kTc* ,k—l) 

QjjfLc* k — l^k “1“ 


Now we study bound for each individual term inside the sum 
over k. To simplify notation, we omit the dependence on k, 
c* and k — 1 without causing confusion. In the following let 
z = y — y = a'^{x — ji) ^ w, and let g = a’^{y — y). Hence, 
1^1 < \/3\ ' \fi — y\ is bounded. Note that 


i_ (y - _ {y - 

aTSa + (T2 aTl]a + (j2 

+ (y-aT^)2|^| 

2 \g\-\z-g/ 2 \ ^ 

- -+ W 

\ 0 \^ + 2 \g\\z\ 2/3<5 

— + ji- 

(26) 


{y - {y - 


aTSa + 


< 


< 


a'^'Ea o 
[y-a^jlf [y-a^gf 


aT Sa + cr^ aT Sa + cr^ 

2|p| • |aT(x - g) - g/z 

cr^ 


Since m < n, from ( [26] ) we have 

iVk - alfic-,k-i)‘^ iVk - alMc%fe-i)^ 


fe=i 


< max 

k=l 


a^Sc*,/c-iU/c + 

{yk - algc*,k-i)^ {yk - Qfe/ic%/c-i)^ 


alT^c* ,k-lCtk A 0-‘^ 


- :;4 • max{cr^(|p/,| ^2\zk\)\gk\ +/5/ck/cp4-i} • 


a‘ 


Note that Zk = al{x - gc*,k-i) ^ Wk ^ A/'(a^(/ic* - 
he*,k-i)^ CLk^c*,k-iO'k + cr^), SO for some t G (0,1), we have 

\^k\ 

< ~yj[{^c*,k + ^k-l)^k + + (a^(/ic* — gc*,k-l))‘^ — — 

with probability exceeding 1 — where bk is bounded. 
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Finally, 

m C 


and for i 7 ^ jk-i, 


1 


m 


{Vk — 

/c=l ^=1 ^/c —1^/c “1“ 


_ (^fe -Q^feMc*,fe-l)^ 
^^2 ^c* ,k — l^k H“ 


< 




^ 21n|C| 1 ^[ 2/11 I o ^k i: \ 

-^-(|^/c| + 2 —)|^/c| +/3/e^(5/c_i > 

777/ (T k —1 I 6 Z/ I 


s;‘:.‘j...+</VA 


<S!" 


,1-., -/’''•'*')+■/"/& 


^0 


sSt'i-. +<'VA 


with probability at least 1 — 77 /f^. Let 


^ ^(fc-i) i +7 


1+7 


A = max{max(£>fe,4_i)}, 

/c=l 

and let 

U = \ ■ max {cr^(A + 2nbk) + Pkn^bl} . 
cr^ k=i 

We choose t = Ijn. Then 

Vo = UA. 

Note that when there is no mismatch, 6 k-i = Qk = 0 for /c G [n], 
which leads to A = 0 and thus 7^0 = 0. Here 77 = 1/2 is a 
parameter used in the multiplicative weight update algorithm. 
In particular, we can identify the correct component c* with 
probability 1 — 1/n whenever m = O (ln|C|/(77 + 770 )). For 
/c = 1,..., 777 /, we choose /3/c = 1. Thus, we need at most 
c 


Therefore, 


tv{^k) < (1 - )tr(I]fe_i)-G-s 


1 + 7 


1 + 7 


Jk—lJk—1 


(n-iy^ i)+l 

n(l + 7 ) MSfe-i). 


□ 


^ ^ uIq O 


c=l 

amount of power in total. 


f ln|C| 
V^ + ^o 


□ 


Note that | gk | can be computed recursively. We may derive a 
recursion. Let Zk = al(x - Vk-i) + ujk = Vk - (^Ivk-i^ Also 
Let gk = a'^ijUk - V'k)- Note that gk = a^^k for ik = j^k - l^k 
in Based on the recursion for in that we derived 
earlier, we have 


Proof of Theorem 0 Let e > ^\\1:k\\-xUp), i-e. ||Si.|| < 
Xn,p,e- Then Theorem 0 follows from 

“ kKh < s] 

> '^Xr^N'(pK,^K)[\\^ ~ Mif II 2 < \/||Sif II ■ e2] 

> “ PKy'^K~^{x - Hk) < Xnip)] = P- 

(27) 

From Lemma 0 we have that when the powers pi are sufficiently 
large 

IISxll < tr(SK) < (1 - -^^^)^tr(S). 

Hence for to hold, we can simple require (1 — 

n{i+'y) ^ Xn,p,E, or equivaleutly (151 in Theorem^ □ 


Qk = 


a 


Pk^k + 


[Qk-l + 


and 


\Qk\ < 


^k{Pk/cr^) + 1 


\Qk-l\ + 


alEk-iOkiVk - alvk-i) ^ 

Pk^k + - a\Ek-iak 

^k 


(A/c -4) Ecj^/Pk 


k^|]. 


Proof of Lemma^ The recursion of the diagonal entries can 
be written as 








2+7+. +<'VA 


-^(^k l)^(/c 1) 


s;::,7.(i-4:?)+sr‘vv/3. 


(/c-1) 2 ^ 


2+71-.+ »VA 


Note that for i = 4_i, 

^(k) 


jk-ljk-1 y^{k — l) \ 2 10 ~ 1 ^ ^ Ifc-I-Z'fc-I ’ 
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